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Abstract. The group lasso is a penalized regression method, used in regression problems 
where the covariates are partitioned into groups to promote sparsity at the group level. 
Existing methods for finding the group lasso estimator either use gradient projection methods 
to update the entire coefficient vector simultaneously at each step, or update one group of 
coefficients at a time using an inexact line search to approximate the optimal value for the 
group of coefficients when all other groups' coefficients are fixed. We present a method 
of computation for the group lasso in the linear regression case, the Single Line Search 
(SLS) algorithm, which operates by computing the exact optimal value for each group (when 
all other coefficients are fixed) with one univariate line search. We perform simulations 
demonstrating that the SLS algorithm is often more efficient than existing computational 
methods. We also extend the SLS algorithm to the sparse group lasso problem via the 
Signed Single Line Search (SSLS) algorithm, and give theoretical results to support both 
algorithms. 



1. Introduction 



Consider a normal regression problem with response vector y £ W" 
X £ M nxp that is decomposed into 'blocks' or 'groups', as X = (X\ 



and covariate matrix 
X 2 ... X K ). In this 



linear regression setting, the group lasso estimator (Kim et al. 
a coefficient vector f3 E MP which minimizes the objective function 



2006 jYuan and Lin 2006) is 



(1) 



y-Xp\\ 2 2 + \ 



K 

Ei 

k=l 



Jk\\2 



where A > is a penalty parameter. This penalized regression method addresses the problem 
of model selection when the true model is believed to be 'groupwise sparse'; that is, when the 
smallest true model might plausibly exclude some of the groups {X±, . . . , Xk} entirely. This 
setting is found in many applications, since covariates are often naturally grouped in some 
manner. Each group might contain a number of factor levels of a single factor (for instance, 
in genetic data, the factor could indicate the presence of zero, one, or two copies of a rare 
allele), or might consist of a set of related quantitative variables. It has been shown that the 
group lasso objective function can be applied to accurately and efficiently recover group-wise 
sparse signals (Kim et al. 2006 Meier et al. 2008 Yuan and Lin, 2006), and that the group 



lasso estimator shows asymptotic consistency even when model complexity grows with sample 
size (Nardi and Rinaldo, 2008). The group lasso has been discussed in many settings aside 



from normal regression, including logistic regression (Meier et al. , 2008) and generalized linear 



models (Kim et al. , 2006 ). The group lasso has been applied to multivariate regressions, where 



the response variables are expected to have similar or identical sparsity patterns and therefore 
the matrix of coefficients is likely to be row-wise sparse, with asymptotic consistency results 
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(Obozinski et al. 2008). Similar objective functions have been proposed to handle a range of 



settings, including the possibility of overlapping groups (Jacob et al. 2009), and multi-task 



learning (Lounici et al. 2009 Obozinski et al. 2010). 



We remark that it is common to include an unpenalized intercept term or other unpenalized 
terms in the group lasso. However, such an objective function can be reduced to the form of ([I]) 
by regressing out all unpenalized covariates from the response and the penalized covariates. 
Additionally, there are many settings where we might wish to place different positive penalties 
on the different groups, giving the more general objective function: 

K 

L{P) = \\\y-Xp\\l + Y J \k)\ n >< 



> k 2 



k=l 



For instance, penalties are scaled with the square root of group size to penalize larger groups 
in '. 



Meier et al. (2008). However, rescaling the groups of covariates can transform this objective 
function into the form of ((!]). Therefore, in this paper we only consider the case where each 
group has equal positive penalty A. 

Recently, the sparse group lasso was proposed as an extension to the group lasso, placing 



an additional penalty on the 1-norm of the coefficient vector (Friedman et al. , 2010 Wu and 



Lange, 2008). The objective function is then given by 

K 

(2) L 1 (/3) = i||y-X/3||| + A 1 ^||/ 



>k 2 



k=l 



As in the group lasso problem Q we assume that the penalty Ai on group norms is positive; 
we may also assume A2 > 0, since @ reduces to if A 2 = 0. |Friedman efal] pOlO] ) argue 
that this method is more appropriate when there is the possibility of within-group sparsity, 
and show that optimizing the objective function does in fact recover both group- wise and 
within-group sparsity in simulations. As with the group lasso, a sparse group lasso problem 
with unpenalized covariates may be reduced to an objective function in the form of ([2]). 

The solution to a group lasso or sparse group lasso problem is not necessarily unique. As 
a simple example, consider the case of repeated groups, where X kl = X^U for two groups 
ki 7^ &2 and some orthogonal matrix U. This may produce an infinite solution set. (Or, if the 
penalties vary across the groups, we might have XT^X^ = XJ^X^U for some orthogonal 



U). However, the minimum of the objective function is always attained, and there is a unique 
optimal vector of fitted values (denoted by y in this paper) and a unique penalty term value. 
In other words, in the group lasso case ( |Roth and Fischer[ |2008l ) , 

K K 



(3\p 2 eB ^ Xfi 1 = X(3 2 and ^ 



k=l 



£ 

k=l 



where B = argmin^ L(f3). By analagous reasoning, the same is true for the sparse group lasso 
case with objective function L\. Furthermore we can say that the 'direction' of fif. for each 
group k is unique. The precise meaning of this is explained in the following theorem: 

Theorem 1. Let B be the set of minimizers of the penalized likelihood L for a group lasso 
problem |7p or sparse group lasso problem Q). Then there exists a unique minimal set of 
groups JC C {1, . . . , K}, and unique unit vectors 6 W k for each k & K, such that 

ft g B p k oc v k Vk € AC, p h = Vfe g K , 
where we define a oc b to mean that a = c-b for some nonnegative scalar c. 
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Many advances have been made in recent years for efficient optimization of the group lasso 
penalized likelihood function. The algorithms may be broken down into two broad categories: 
group- wise descent, where each step updates one entire group of coefficients via an inexact 
line search (Meier et al. 2008), and 'global' descent, where at each step the entire coefficient 



vector could potentially be updated (Kim et al. 2006 Roth and Fischer 



2008). An efficient 



approach to the corresponding online learning problem, developed by Yang et al. (2010), 
handles the online versions of both the group lasso and sparse group lasso. Each step of the 
online learning algorithm is very efficient, and the algorithm requires no precomputations. 
Since online learning is a very different task from the offline convex optimization problem 
which we seek to solve, we do not attempt to compare this algorithm to the others in our 
simulations. 

The main result of this paper is the 'Single Line Search' (SLS) algorithm for solving the 
group lasso problem. The efficiency of this method lies in the computation of the exact 
optimal value for the coefficients of any single group (fixing the other coefficients) via a single 
univariate line search, which corresponds to finding the 2-norm of the optimal coefficient 
vector for that group. We state several theoretical results, showing that each group's update 
is indeed optimal, that the algorithm converges to the minimum of the objective function, 
and that at any finite time in the algorithm the distance to convergence can be bounded 
in terms of the present subgradient norm. We present simulation results showing that this 
method performs faster on the group lasso problem than existing offline learning algorithms 
(including both group- wise and global descent algorithms). 

Turning to the sparse group lasso problem, we extend the SLS algorithm to handle the 
additional l\ penalty on the coefficient vector. This method is, to our knowledge, the only 
existing algorithm for solving the sparse group lasso problem in its 'offline' form. The struc- 
ture of the SSLS algorithm makes it practical only when the groups' sizes are quite small; 
therefore, we discuss strategies for developing an efficient algorithm to solve the sparse group 
lasso problem which is more flexible with respect to group size. We also discuss possible 
extensions to the SLS algorithm, including extending the algorithm to models other than 
linear regression, and adaptations to the algorithm which may increase efficiency in extremely 
high-dimensional group lasso settings (as examined in Roth and Fischer , 2008 for instance) . 

The remainder of the paper is structured as follows. In Section [2j we summarize existing 
methods for computing the group lasso solution. In Section [3j we introduce the SLS algorithm 
and the main related theoretical results. Results of simulations comparing the SLS algorithm 
to existing methods are given in Section |4j In Section [5| we introduce the SSLS algorithm for 
the sparse group lasso and give theoretical results, and also describe the existing algorithm 
for solving the online version of the problem. Section [6] contains the discussion of our results 
and of future directions. Unless otherwise noted, all theoretical results in the paper, including 
Theorem [l] above, are proved in the Appendix (Section [7]). 

Since making this manuscript publicly available, we have been made aware of the earlier 

which derives the same result for the (non-sparse) group lasso 



(2009), 



work by |Puig et al. 

setting. We leave this manuscript available as a technical report, to serve as a reference for 
the previously untreated sparse group lasso case (the SSLS algorithm), and for the timing 
comparisons of various methods in the group lasso setting. 



2. Prior work 



In this section we outline prior work on the group lasso problem, consisting of both group- 
wise descent and global descent methods. 



1 



RINA FOYGEL AND MATHIAS DRTON 



2.1. Group-wise descent. We first examine existing computations and methods for group- 
wise descent. Since only one group at a time is being updated, we may restrict our attention 
to the subproblem of finding (3k to minimize 

\\\Rk — XkfikWl + A IIAc||2 > 

where Rk is the remainder when all other coefficients are fixed, Rk = y — Yli^k Xi(3i- For 
simplicity of notation, we change variables and write this subproblem as the minimization of 

(3) Q{a) = \\\b- Aa\\l + \\\a\\ 2 , 

where b E W 1 , A E M nx ", A > 0. 

The objective function ^ is clearly convex in a, and by Theorem[l]has a unique minimizer 
a. A subgradient of Q at a is any vector 

-A T b + A T Aa + As , 

where s = a/||a||2 for nonzero a, or may be any vector of up to unit norm when a = 



(Bertsekas, 1999; Meier et al. 2008). The subdifferential of Q at a, written dQ(a), is the set 
of all subgradients at a. Since Q is convex, the subgradient condition for optimality shows 
that a is optimal if and only if E dQ(a). 

It is clear from the known subgradient condition that 

E argmin Q{a) \\A T b\\ 2 < A . 

a 

The zero case is therefore simple and we turn to the case that a = is not optimal. 



Yuan and Lin ( 2006 ) give the solution to the subproblem in the case where the columns of 
A are orthonormal. In this case, examining the subdifferential shows that a ^ is optimal if 
and only if 

= -A T b + a + XjAt- ( 1 + 7i— n- I a = A T b ^ a = 1 1 - „ * „ 1 A T b 



1MI2 \ IMb/ V M T fr||2, 

Computing this last quantity is very fast, and we may use it in any setting to compute 
a group- wise sparse regression by orthonormalizing each group of covariates Xk- However, 



Friedman et al. (2010) raise the point that the resulting solution, when transformed back 
to the original basis for each group, will not be a solution of the group lasso problem with 
the original covariates. In many situations, orthonormalizing each group of covariates may 
be unnatural or undesirable. Therefore, methods which do not require orthonormal X^s are 
necessary. 

For the general case, where covariate groups are not assumed to be orthonormalized, an 



iterative procedure updating coefficients one group at a time is proposed by Meier et al. 



(2008), and implemented in the R package grplasso for both the linear and logistic regression 
settings. A rough sketch of their method is as follows. Each iteration of the algorithm cycles 
through the groups, updating f3k via a quasi-Newton method. Specifically, at group k, holding 
the coefficients for all other groups fixed, the algorithm will seek to improve the estimate of (3k 
as follows. Let (3° denote the present value of the coefficient vector. First, the (unpenalized) 
negative likelihood function is approximated, near (3®, via a quadratic function in d = (3k — (3®, 
with quadratic term c • d T d for some scalar c > 0. (Note that, in the linear regression setting, 
the negative likelihood function is always a quadratic form with positive semidefinite leading 
coefficient matrix; however, the coefficient matrix in the quadratic term might not be of the 
form d Pk for any c.) The objective function is then approximated by adding the penalty 
term. Finally, the algorithm computes a minimizer d of the approximated objective function, 
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and updates = (3® + s ■ d, where the scalar s is chosen via an (inexact) Armijo line search 
(or updates /3& = if appropriate). 

This method's effectiveness lies in the efficiency of the updates, and in the fact that when 
the current estimate is quite close to the optimum, the update to each group of coefficients is 
a very good approximation of the true optimum for that group of coefficients when the other 
groups are fixed. The algorithm also requires very little precomputation. 

2.2. Global descent. Next, we examine existing methods which update the entire coefficient 



vector simultaneously at each step. Kim et al. (2006) propose one such method, in which the 



sum of the groups' 2-norms is bounded rather than penalized: 
(4) argmin(||y-X/3||2 : ^ 



>k 2 



< M 



k=l 



1 2 for some A > whose exact value 



(This is equivalent to placing a penalty of A on Yl,k=i 
depends both on M and on the data). In the terminology of Section 2.3 of Bertsekas (1999), 
Kim et al. s ( |2006 ) algorithm is a gradient projection method with constant stepsize. A brief 
outline of the algorithm is as follows. Given an estimate of j3, the algorithm first computes the 
gradient of the (unpenalized) likelihood function, and takes a small step along that gradient. 
Next, this resulting vector is projected to the closest vector satisfying the bound condition on 
the sum of group norms. This process is then repeated until convergence. 



Roth and Fischer (2008) propose a modification of Kim et al. s (2006) algorithm, which 



makes use of group-wise sparsity for faster convergence. At each iteration, the algorithm 
has some hypothesized 'active set' of groups which are currently included in the model. The 



coefficient vector is then optimized over that active set alone, using Kim et al. s ( 2006 ) op- 



timization algorithm. Once convergence on the active set is reached, the solution is tested 
for optimality; if it fails, then the active set is updated based on that information, and the 
procedure is repeated. This algorithm may be particularly efficient when there is an optimal 
solution involving only a very small fraction of the groups of covariates. In particular, their 
experiments show improved time by several orders of magnitude in such scenarios. 

Overall, a global search algorithm may be particularly efficient when there is high corre- 
lation between the groups of coefficients, because group-wise descent may result in 'zig-zag' 
paths to the optimum in this type of setting. 

3. The SLS algorithm for the group lasso 
We first state a result which motivates our method. 
Theorem 2. Define 



Q(a) 



i 



Aa\\ 2 + A 1 1 ce 1 1 2 



where b € W 1 , A 6 W ixq , A > 0, and a may take any value in W. Let A T A = U T DU be the 
spectral decomposition, with D = diagjdi, d 2 , ■ ■ ■ , d q }. Define v = UA T b. Then: 
i- If || v || 2 < A then a = is the unique minimizer of Q. 
ii. // \\v\\2 > A, then there is a unique r £ M + satisfying 



m 

Furthermore, if we define 

a(r) 



V- V 3 

(djr + A) 2 



U T (D + r- l \I p )- 1 i 
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then a = a(r) is the unique minimizer of Q. 

We are now ready to define the Single Line Search (SLS) algorithm; see the pseudocode in 
Algorithm [TJ The intuition for the algorithm is simple. During each iteration, we cycle once 
through the groups. At group k, we fix the coefficients outside of the group and compute the 
partial residual Rk = y — Xw^fc^A- We then apply Theorem |2j to find the exact optimal 
value for given the fixed coefficients outside the group. (Specifically, in the notation 
of Theorem [2j we may easily solve for r using Newton's method, since f(r) is a strictly 
decreasing function with a derivative that is simple to compute). This strategy involves more 
pre-computation than the existing algorithms, as it requires a spectral decomposition for any 
group which is included in the model at any stage of the algorithm. In the scenarios we 
consider in our simulations, however, this one-time computational cost is outweighed by the 
efficiency of each group's update. 

An immediate corollary to Theorem [2] is the following: 

Corollary 1. Let /Jw be the coefficient estimate after t iterations of the SLS algorithm for 
t = 0,1,2, ... . Then for all t, 

£(/3 (m) ) < L(/3 (t) ) • 
That is, each iteration of the algorithm does not increase the objective function. 

Next we state a convergence result for this algorithm, which follows directly from Propo- 



sition 5.1 of Tseng (2001). Note that no conditions are necessary on the data (X,y) or the 
(positive) penalty A. 

Theorem 3. Let (3^ be the coefficient vector after the t th iteration of Algorithm [/} Then 
L (/?(*)) -> min /3 L(/3). 

Finally, since in practice we will wish to terminate the algorithm after a finite number of 
steps, the following theorem gives a guarantee of accuracy. When terminating the algorithm 
after t iterations, we can apply the theorem below (with f3* = /?W) to bound the error in the 
current estimate of the optimal fitted values y. 



Theorem 4. Take any j3* G MP, and any w* G <9L(/3*). 
of fitted values. Then 

\\XP* - y\\l < 2(w*) T f3* + 0( 
with precise bounds given in the Appendix. 



Let y be the unique optimal vector 



w 



4. Simulations for the group lasso 
We compare the speed of the SLS algorithm to the three existing methods described 



above: Meier et al. s (2008) group- wise search algorithm, Kim et al. s (2006) bounded global- 



update algorithm, and Roth and Fischer 's (2008) active-set modification of Kim et al. s (2006) 
method. 

Our simulations vary along three different parameters: the total number of groups, K; 
the level of within-group correlation, o; and the level of between-group similarity, b (each 
described in detail below). For each parametrization, we run 100 trials. In each trial, we 
generate covariates X and response y, and also a decreasing sequence of 5 penalty parameters 
{A 1 , . . . , A 5 }. We run each of the four algorithms on sequence of group lasso problems defined 
by the data (X,y) and the penalty parameter sequence {A 1 }. For each parametrization, after 
running 100 trials, we record the average time used by each algorithm using the proc.timeQ 
function in the software R (R Development Core Team, 2010"). 
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Algorithm 1 Single Line Search (SLS) algorithm for the group lasso 
Input: y £ W n , X x € R nxpi , . ..,X K £ R nx PK , A > 0. 

Output: P E W minimizing L{p) = - X/3\\l + A Y,k=i Wkh, where p = p 1 + • • • + p K 
andX = (X 1 ... X K ). 
Initialize: p <= 
repeat 

for k = 1,2,..., K do 

if ||Xji4|| 2 < A then 

/3 fc ^0. 
else 

Compute the spectral decomposition X^Xf. = XJ^DyXJ^ if not previously computed, 
and write D k = dmg{d\, . . . , }. 
w fc 4= U k X^R k . 

Find the unique r > satisfying /(r) = Y%L\ (dtr+iy = 1 ' 

Pk^U^Dk + r^XIp^Vk- 
end if 
end for 

until some convergence criterion is met. 



4.1. Implementation of the algorithms. Existing code for the various methods is imple- 
mented across different environments (such as C and R). For a fair comparison, therefore, we 
re-coded the methods in R (R Development Core Team, 2010) using the pseudo-code given 



in the papers, and implemented the SLS algorithm in R as well. 



Code for the method in Meier et al. (2008) is available via the grplasso package in R. 



For an unbiased comparison with the SLS algorithm, we took our existing code for SLS, and 
replaced the SLS group update step with the the part of their code pertaining to the group 
update step. (Except for this update step, the structure of the two algorithms is identical, 
since both are group- wise descent algorithms). 

We coded the algorithms to run on decreasing sequences of penalty values {A'} for the 
SLS and the Meier et al. (2008) algorithms, or increasing sequences of bound values {M 1 } for 
the Kim et al. (2006) and Roth and Fischer (2008) algorithms. In each algorithm, convergence 



at each penalty or bound value is determined by the stopping criterion 



< 10" 



4.2. Simulated data. We generate the data as follows. In each simulation, we have n = 50 
samples. The number of groups, K, ranges in the set {10, 20, 40, 80}, but the number of groups 
in the true smallest model is always 2. Each group has 10 covariates. The true coefficient 
vector Po is given by 

' lio, k = 1,2, 
Oio, k>2 , 



(A 



0)k 



where l m and m are the vectors in M m with all entries equal to 1 or 0, respectively. 

Each row of X is sampled independently from a iV(O n , S) distribution, where £ is deter- 
mined by two parameters: within-group correlation, a, and between-group similarity, b, as 
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follows. For a,b £ [0, 1], we define E group- wise as E 

/ 1 a ... a \ 

n I n 



l<k!,k 2 <K, 



with: 



V 



1 



1. 

6, 



k\ = k2 
ki / k 2 



J 



In our experiments, we simulate nine different correlation structures, by pairing 
a £ {.2, .5, .8} (low, medium, or high within-group correlation) with b 6 {.2, .5, .8} (low, 
medium, or high between-group similarity). We then generate y ~ N(X(3o,c 2 I n ), with c 2 
defined as 

c 2 = 0.01/3jf EA) = 0.01 • Var(x T (3 ), 
where x ~ iV(0, E) represents a single draw of a row of X, in order to produce a high signal- 
to-noise ratio. 

4.3. Penalties and bounds. In practice, it is often useful to compute a 'solution path' over 



a set of values of the penalty parameter A. In fact, Roth and Fischer (2008) observe that, for 



their method, given a bound M, it is actually often faster to compute a solution path along 
an increasing sequence of bounds M 1 < M 2 < • • • < M N = M, than to directly compute the 
solution for the bound M. This sequence of increasing bounds corresponds to a decreasing 
sequence of penalty parameters, A 1 > A 2 > • • • > A^. Meier et al's grplasso package is 
also implemented to find solutions along such a solution path, meaning that the solutions 
for penalty values {A 1 } are computed sequentially, using the final solution /3 A ' for penalty 
parameter X 1 as an initial estimate for computing the solution /3 A * +1 (and using P a an initial 
estimate for computing /3 A ). We use this sequential stucture (with decreasing penalties {A 4 } 
or or increasing bounds {M 1 }, as appropriate) in our implementation of each method. 

In each simulation, to choose a sequence of penalty parameters, we first compute A maa: , 
defined as 

A ma x = sup{A>0 : /3 A /0} 



max 

k 



\xhh 



where /3 A is any solution to the group lasso problem with penalty parameter A. (The last 
equality follows from the subdifferential condition). We then choose the sequence {A 1 = 

Amax x 2 J }l<i<5- 

For each choice of (X, y) and for a value A* of the penalty parameter, we find the bound M l 
that produces the same solution set by computing /3 A * (via the SLS algorithm, for example) 
and defining M % = Y^k—i II AiT lb; the solution of ^ with M = M % is then equal to the solution 
of |i~t with A 



(2008) algorithms, which use bounds on the sum of group norms, to the SLS and Meier et al 



A 1 . This allows us to compare the Kim et al. (2006) and Roth and Fischer 



(2008) algorithms, which use penalties on the group norms. 



4.4. Results. Results for our simulations are displayed in Figure [I] (note that the time axis 
is drawn on a log scale). Under any choice of parameters a, b, and K, the SLS algorithm 
converges faster than any of the other four methods considered, with one exception (a = .2, b = 
.8,K = 10) when it is slightly outperformed by Meier et al. s (2008) algorithm. Considering 



the results as a whole, the most comparable method to the SLS, in terms of performance, 
is Meier et al. s (2008) algorithm, which performs almost as fast as the SLS algorithm in 



some simulations. The SLS algorithm's improvement in speed relative to Meier et al. s (2008) 



algorithm is strongest for higher values of a and lower values of b. This is intuitive, since 
a high value of within-group correlation a means that gradient approximations to the group 
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Figure 1. Time until convergence in the group lasso with 10, 20, 40, or 80 



total groups (K), for the SLS algorithm, the Meier et al. (2008) algorithm, 



the Kim et al. (2006) algorithm, and the Roth and Fischer (2008) algorithm. 



Parameters: o=within-group correlation, 6=between-group similarity. The 
vertical (time) axis is drawn on a log scale. 

optimization will tend to not be very accurate, and therefore the optimization step of the SLS 
algorithm is likely to improve time considerably. On the other hand, a high value of between- 
group similarity b means that many groups will be included at some stage of the algorithm, 
and so the SLS algorithm will have many spectral decompositions to perform. The remaining 
two methods are consistently slower than the SLS algorithm in the settings simulated here, 



and, depending on the setting, slower or comparable to the Meier et al. (2008) algorithm. 



Overall, the efficient structure of the SLS algorithm is clearly evident in its faster compu- 
tation time relative to the other algorithms, when the number of groups is moderate as in 
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these simulations. (Adapting the algorithm to be efficient in very high-dimensional settings 
is discussed in Section [6]). 

5. The SSLS algorithm for the sparse group lasso 



In recent work, Wu and Lange (2008) and Friedman et al. (2010) discuss the question of 



within-group sparsity. The original group lasso has the property that, in general settings, 
with probability 1, each group of coefficients will be either entirely zero or entirely nonzero in 
the optimal solution. While this is natural in some settings, there are many settings in which 
allowing for within-group sparsity would be more plausible, and may help to recover a signal 



more accurately. The proposed penalized likelihood function (Friedman et al. , 2010 Wu and 



Lange , 2008 ) is given by 



K 



^/3|l2 + Ai^||/5 fc ||2 + A s 



k=l 



Friedman et al. (2010) show that the subdifferential of L\ at /3 is separable over the groups, 
and that the subdifferential of L\ with respect to the /c th group is given by 

dfoLxtf) = -Xl{y - X(3) + X lSk + X 2 t k , 

where s k = Pk/WPklU whenever /3& ^ and may be any vector of up to unit norm if (3 k = 0, 
and (tk)j = sign((/3&)j) whenever (Pk)j 7^ and may be any number in [—1, 1] if (flk)j = 0. 
A coefficient vector is therefore a minimizer of L\ if and only if each group's subdifferential 
contains the zero vector: 

$ G argminLi(/3) ^ G dp k L\(fl) Vfc . 

We now describe an adaptation of the SLS algorithm, which can solve the sparse group lasso 
problem effectively for small group sizes. We first explain the intuition behind the algorithm. 
When updating a single group, the relevant subproblem consists of minimizing 



(5) Qi(«) = \\\b — Aa\\l + Ai[|a[|2 + A2IHI1 , 

where b G M n , A G M nxq , Ai, A2 > 0, and a may take any value in M 9 . We denote the optimum 
by a. 

As observed in Friedman et al. (2010), 

a = & ||{A T 6} A2 || 2 < Ai , 

where {•} denotes the soft threshholding operation, defined for a real value i£lby {^}a 2 = 
sign(x) • max{|x| — A2,0}, and defined on a vector by applying the operation element-wise. 
When a = is not optimal, the subgradient condition for optimality is therefore given by 

= -A T b + A T Aa + Ai Tr ^ r + A 2 t , 

1Mb 

where tj = sign (ay) if ctj 7^ 0, and may be any number in [—1, 1] if ctj = 0. Next, observe that, 
if sign(a) is known, then we may solve for a via the same strategy as in the SLS algorithm. 
Specifically, if sign(a) = s for a known s G {—1,0, l} 9 , then defining J = {j : Sj 7^ 0}, we 
know that ajc = 0, and that dj must satisfy 

= -A T jb + A T jAjaj + Ai^^ + A 2 sj , 
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where Aj is the n x | J| matrix consisting of the columns of A with indices in J. Since s is 
assumed to be known, we may apply Theorem [2] to solve for aj. 

In practice, the optimal sign vector s is not known. However, we may cycle through all sign 
vectors s £ {— 1,0, l} 9 , attempt to solve for a under choice for s, and check for optimality. 
This intuition is formalized in Theorem [5] below. 



Theorem 5. Define 



Qi(a) = l\\b - Aa\\l + Xi\\a\\ 2 + A 2 ||a||i 



2 I 

where b £ 1", A £ M nxg , Ai, A 2 > 0, and a may take any value in M q . Then: 

i. Suppose \\{A T b}\ 2 || 2 < X\. Then a = is the unique minimizer of Q\. 

ii. Suppose \\{A T b}\ 2 \\2 > X±. For any vector of signs s £ { — 1,0, 1} P ; write J 
{j : sj 7^ 0} ; and let A^Aj = UjDjUj be the spectral decomposition, with Dj 
diagjdf, . . . , d^}. Define also v s = Xjb — A 2 sj and 



Mr) = x; 



J 



j^idjr + X 



D 2 



Define s £ {—1, 0, l} g to be the (unknown) vector of signs of the true optimal solution 
a = argmin a Q\{a). Let s be any vector of signs in {—1,0, l} q . Then: 

a. Ifs = s, there will be exactly one r satisfying fs(r) = 1. Furthermore, if we define 

aj(r) = Uj(Dj + r^AiJui)- 1 ^ and a{r) 3 = | j ^ . 

then the following feasibility conditions will be satisfied: 

sign(a(r)) = s and V j £" J, \{Aj(b - Aa)}x 2 \ < Xi . 

Moreover, a = a(r) is the unique minimizer of Q\. 

b. If instead s/s, then either f s (r) = 1 will have no solutions, or it will have one 
solution r with a(r) failing the feasibility conditions. 

We are now ready to define the 'Signed Single Line Search' algorithm for the sparse group 
lasso; see the pseudocode in Algorithm [2] We note that, at the step updating group k, 
this algorithm could potentially cycle through as many as 3 Pk sign vectors before finding the 
optimal group coefficient vector. Therefore we might expect this algorithm to be, at worst, up 
to (3 max fc{Pfc}) times slower than the SLS algorithm. However, cycling through the possible 
sign vectors may be done in an order that is better than random, lowering the expected 
number of sign vectors which needs to be tested at each step. 

By Theorem [5j we know that at each step, the algorithm finds the optimal value for (3^ 
(conditional on the other groups' coefficient estimates at that time). We therefore have the 
immediate corollary: 

Corollary 2. Let be the coefficient estimate after t iterations of the SSLS algorithm for 
t = 0, 1, 2, . . . . Then for all t, 

M/3 (m) ) < • 
That is, each iteration of the algorithm does not increase the objective function. 

Finally, as with the SLS algorithm, we state convergence and accuracy results. The con- 



vergence result again follows directly from Proposition 5.1 of Tseng (2001). The proof of the 



accuracy theorem is very similar to that of Theorem HI and we omit it in this paper. 
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Algorithm 2 Signed Single Line Search (SSLS) algorithm for the sparse group lasso 

Input: y G R n , X x G M nxpi , . . . , X K £l" x », Ai,A 2 > 0. 

Output: p minimizing Li(/3) = \\\y - Xp\\ 2 2 + Ai £f=i ||/3 fe || 2 + A 2 ||0||i. 

Initialize: /3 /?(°). 

repeat 

for k = 1,2,..., K do 

if \\{XjR k } X2 \\ 2 < Ai then 

else 
repeat 

Choose some sign vector s G { — 1,0, l} Pfe . 

Solve the optimization problem associated with s as in Procedure [3| 
until a feasible solution has been found 
end if 
end for 

until some convergence criterion is met. 



Theorem 6. Let /3® be the coefficient vector after the t th iteration of Algorithm [#| Then 
L 1 (/3W)^min /3 L 1 (/3). 

Theorem 7. Take any (3* G M p , and any w* G dL\{(3*). Let y be the unique optimal vector 
of fitted values. Then 

\\Xp* -y\\l<2{w*) T p* + 0{\\w*\\ 2 ) . 

Lacking competing methods to compare to, we do not report on numerical experiments 
with the SSLS algorithm. However, we remark that an R implementation solved sparse group 
lasso problems with 40 groups of size 5 (and with a single choice of penalty parameters (Ai, A 2 ) 
which produced appropriate sparsity patterns) in a few seconds. 

6. Discussion 

For the group lasso in the linear regression setting, the SLS algorithm offers a fast and exact 
group-wise update step, which, in our simulations, performs very well on moderately-sized 
problems as compared to existing methods. One immediate extension of the SLS algorithm 



would be to make use of the 'active set' construction developed by Roth and Fischer ( 2008 ) , the 



framework of which may be combined with any algorithm that solves the group lasso problem. 



Their work shows that adding this 'active set' construction to Kim et al. s ( 2006 ) global descent 



algorithm may speed up computation considerably. Combining the SLS algorithm with Roth 



and Fischer's (2008) 'active set' construction is therefore likely to improve computation speed 



on very large (and very sparse) group lasso problems. Furthermore, while this paper's focus 
is on linear regression, our methods may be extended to other likelihood functions via second- 



order approximations, as in the work on the logistic case in Meier et al. (2008). However, since 



any other likelihood function will not be exactly quadratic (as in the case of linear regression), 
our method will not be able to solve directly for each group's optimum value (fixing the other 
groups' coefficients), and so it is not clear whether an improvement in speed can be expected 
in non-linear regression. 

In the case of the sparse group lasso, there are many possibilities for developing a more 
efficient algorithm based on the same principles as the SLS algorithm for group lasso. The 
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Procedure 3 SSLS subroutine 




if sign(a) = sj, and for all j g J, \{(X k )J(R k - (X k )ja)}\ 2 \ < Ai then 



(f3 k ) j <= ay for all j G J. 



(Pk)j <= for all j g* J. 



This solution is feasible. 



else 



No feasible solution exists. 



end if 
else 



No feasible solution exists, 
end if 



strategy of exhaustive search through sign configurations is, of course, impractical for even 
a moderately large group size. One alternative approach is to reduce to single-coordinate 
descent rather than group-wise descent (in order to avoid the issue of sign configurations). 
However, this is potentially problematic, because any coordinate descent approach to either 
a group lasso or sparse group lasso problem has the drawback of occasionally converging to a 
non-optimal solution. Specifically, the coefficients within some group k may become 'trapped' 
at zero, even when /3 k = is not optimal, due to the structure of the 2-norm penalty. We 
illustrate this with an example; the example is phrased in the group lasso setting, but may 
easily be adapted to show that the same problem may occur in the sparse group lasso setting. 

Example 1. (The 'zero trap'). Consider a group lasso problem with a single group con- 
sisting of two covariates. Define 



If we fix @2 = and optimize over (3\, we obtain /3\ = 0. If we then update 02, we obtain 

$2 = 0. Therefore, coordinate descent with a starting value of fi\ = $2 = will never leave 

in 

this value. However, the value of (3 which minimizes L is given by f3\ = @2 = 1 — \- ■ 

We conclude that, in any situation where some groups might have a signal which is weak on 
any individual coefficient but significant in total, coordinate descent methods of optimization 
may not be reliable, and thus requires extra care to allow us to circumvent the problem of 
sign configurations. 

In the SSLS algorithm, any given update of the k th group may need to test up to 3 Pk 
sign configurations. However, when the algorithm has neared the true solution, we might 
expect 'sign stabilization'; that is, the optimal sign vector at iteration t may be unchanged 




With these data and parameter values, the objective function in Hv becomes: 



L(/3) = i(l-/3 1 ) 2 + i(l-/3 2 ) 2 + ||/3|| 2 



11 



RINA FOYGEL AND MATHIAS DRTON 



at iteration (t + 1). This suggests that attempting a signed single-line-search update for each 
group may be very efficient after a certain point. For early iterations, when many groups are 
not yet 'sign stabilized', other methods (such as gradient-based methods) could be considered. 
The potential efficiency of this kind of algorithm lies in the fact that whenever a group k has 
achieved sign stabilization, the algorithm could optimize the entire group of coefficients at 
once rather than pursuing any less efficient update strategy. We plan to develop this strategy 
in future work in order to create an algorithm that can solve the sparse group lasso problem 
with moderate or large group size. 



7. Appendix 

In this section we prove all theoretical results stated in the paper. In order for this appendix 
to be self-contained we restate some of the theorems. 

Theorem lj Let B be the set of minimizers of the penalized likelihood L for a group lasso 
problem |I) or sparse group lasso problem |I|). Then there exists a unique minimal set of 
groups fC C {1, . . . , K}, and unique unit vectors vj~ G W k for each k G fC, such that 

(3eB => f} k oc v k Vk G K, P k = VA; g K , 

where we define a oc b to mean that a = c-b for some nonnegative scalar c. 

Proof. (This proof addresses the sparse group lasso case; the group lasso case can be obtained 
by setting Ai = A and A2 = in the sparse group lasso problem). 

We first make an observation about the convexity of L((3). Take any (3 1 ,(3 2 G M p . If 
Phfik 7^ 0' an d there does not exist a positive c with j3\ = then \\f3^ + /3|||2 < \\PlW2 + 
||/3|||2- Therefore, since all other terms in L are convex in /3, we know that -L(^(/3 1 + /3 2 )) < 
|(L(/3 1 ) + L{P 2 )). This implies that for any /3 1 ,/? 2 G B, if ^ 0, then there must exist 

a positive c with 0^ = c/3|. Now define K. as: 

K = {k : 30 G B, p k £ 0} . 

Then, for each k G /C, we can find a unique unit vector v k , such that for any j3 G B, = cv k 
for some c > 0. The uniqueness and minimality of JC are clear from its definition. □ 



Theorem [2] Defi 



11 e 



Q(a) = h\\b — Aa\\% + A 1 1 o; 1 1 2 



2 1 

where b G W 1 , A G M nXl3 ; A > 0, and a may take any value in R q . Let A T A = U T DU be the 
spectral decomposition, with D = diagjdi, cfej • • • , d q }. Define v = UA T b. Then: 

i. If \\v\\2 < A then a = is the unique minimizer of Q. 

ii. If \\v\\2 > A, then there is a unique r G M+ satisfying 

q v 2 

(6) f(T) = T, fJ _ 3 , ^ =1 • 



3 

Furthermore, if we define 



- (d jr + A) 5 



(7) a(r) = U T (D + r- l XI p )- 1 

then a = a(r) is the unique minimizer of Q. 



v 
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Proof. The minimizer of Q is unique due to strict convexity of Q. Friedman et al. (2010) 
discuss the subgradient of Q(a), and the implication that a = minimizes Q if and only if 
1Mb < A; this covers the first case. Assume that the second case holds; that is, ||u||2 > A. 
By Friedman et al. (2010), a is the unique minimizer if and only if it satisfies the subgradient 
equation 

(8) A T Aa + Att-^— = A T b . 

1Mb 

Since \\v\\2 > A, referring to ([6]), we see that /(0) > 1 and lim^oo f(r) = 0. Since / is strictly 
decreasing in r, then there is a unique r > with f(r) = 1. (To check that lim^oo /(r) = 0, 
we compute the singular value decomposition A = V T D X I 2 JJ , where (D 1 I 2 ) T D 1 ! 2 = D in the 
notation of the theorem. Then v = UA T b = {D 1 l 2 ) T Vb, and so for any j with dj = 0, Vj = 
also. Therefore, f(r) vanishes as r — > oo). Let a = a (r). By (§ and 0, 

Therefore, we can rewrite as 

D + Tj-n-Jp ) C/a = C/,4 T 6 . 

H 2 / 



Hence, 



U T [D + jAtIp) Ua=( U T DU + jAtU T U ] a = ,4 T Aa + A^- = ,4 T 6 . 

V Il«ll2 / V \\ a h ) 1Mb 

This proves that a satisfies ^ and is thus the unique minimizer of Q. □ 

Theorem [4] Take any f3* G W, and any w* G dL({3*). Let y be the unique optimal vector of 
fitted values. Then 

\\XP* -y\\ 2 <2{w*) T l3* + 0{\\w*\\ 2 ) . 
More precisely, the error in the estimate of the optimal fitted values is bounded as follows, 
where (3 is any vector in B: 

\\XP* -y\\l < 2{w*) T (3* + 2\\w*\ 



By bounding \\$\\2, we may further obtain the following two bounds (here Plse denotes any 
unpenalized least-squares estimate minimizing \\y — XPlse^)- 

\\Xp - y\\l < 2(w*f(3* + 2\\w*\\ 2 x A" 1 - U p xV\\i) . 

K 

\\Xp - y\\l < 2(w*) T P + 2\\w*\\ 2 x W(hsE)kh • 



k=l 



Proof. First, take any any /3, any 5 G M p , and any subgradient w G dL(/3). Observe that 

i\\ y _ X (J3 + 5)\\ 2 = l\\y- Xf3\\ 2 -(y- X(3) T 5 + \\\X8\\l , 
and from the gradient of the likelihood, 



w G 8L{P) 



{w + {y-Xp))Ed\\^\\P4i\ 
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By the definition of the subdifferential, 

K K 

a£ lies + S)kh > a E WPh + ( w + (y~ x P)f 6 ■ 

k=l k=l 

Therefore, 

K 

L(fi + 5) = \\\y - X(P + 8)f 2 + IK/3 + 



fc=i 



;ife||2 

A' 



> 2 



- - (y - X(3) T 5 + \\\X8\\l + A ]T ||/3|| 2 + (u> + (y - X/3)) T 5 

L(f3) + w T 5+M\X5\\l 



k=l 

,.Tx , 111 vxl\2 



Now take G dL{f3*) and /3 G argmin^ L(f3). From above, 

l0) > L(n + (w*f0 - n + \\\x0 - nwl 

Also, by optimality of $, L(f3) < L((3*). Therefore, 

±\\XP* - yg < (w*) T {P* -P)< (w*f/3* + IKIb 
We next observe that, for any (3 G MP, 

A 



(9) L(fS)>L0)>l\\p£ y f 2 + \J20kh • 

k=l 

Choosing f3 = /3* and applying ([9]), we obtain 

<Y,\\M2<^ (m) - \\\pkv\\l 



k=l 

which yields the next-to-last bound in the theorem. Choosing instead j3 = /3lse and again 
applying Q, using the fact that \\y — X(3lse\\2 = WPxVWh we obtain 

K K 

WPh < E WP*h ^ A_1 (Wlse) - \\\Pky\\t) = E WPlseW* > 

k=l k=l 

which yields the last bound in the theorem. □ 

Remark 1. If 0* = for some large t, then L(/3*) will potentially be much lower than 
L((3lse)- Therefore, the next-to-last bound in the statement of the theorem will be advanta- 
geous. It might be possible that the bound can be improved to a bound that does not include 
A -1 term, but we have not been able to prove this. 

Theorem [5] Define 

Qi(a) = 2 1| 6 — Aa\\l + Ai 1 1 o; 1 1 2 + A2IHI1 , 
where b G 1", A G M. nxq , Ai, A2 > 0, and a may take any value in M. q . Then: 
i. Suppose \\{A T b}\ 2 \\2 < \\. Then a = is the unique minimizer of Q\. 
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ii. Suppose \\{A T b}\ 2 || 2 > Xi- For any vector of signs s G { — 1,0, 1} P , write J 
{j : Sj 7^ 0}, and let A^Aj = UjDjUj be the spectral decomposition, with Dj 
diag{<i/, . . . , Define also v s = A'jb — A 2 sj and 



\J\ 



(10) 



fs(r) = £ 



-(rf/r + Ai) 2 

Define s 6 {—1, 0, l} g to be the (unknown) vector of signs of the true optimal solution 
a = argmin a Qi(a). Let s be any vector of signs in {— 1,0, l} 9 . Then: 
a. Ifs = s, there will be exactly one r satisfying f s (i") = 1- Furthermore, if we define 



11 



12 



aj(r) = Uj (Dj + r v s and a(r)j 



(oij(r))j, j £ J 
0, j J 



then the following feasibility conditions will be satisfied: 

sign(a(r)) = s and V j J, \{Aj(b - Aa)}\ 2 \ < A x . 



Moreover, a = a(r) is the unique minimizer of Q±. 
b. // instead s/s, then either f s (r) = 1 will have no solutions, or it will have one 
solution r with a(r) failing the feasibility conditions. 



Proof. The question of whether a = is optimal is directly covered in Friedman et al. (2010). 
Now assume a = is not optimal. For a 7^ 0, by Friedman et al. (2010), we have: 

(13) dQ{a) = -A T b + A T Aa + A x + A 2 t, 



\ a 2 



where tj = sign(aj) if ay 7^ 0, and may equal any number in [—1, 1] if ctj = 0. 

First we examine the true solution a, with its sign vector s; define Uj, Dj, i>§, /§ as in the 
statement of the theorem. We see that 



= (dQ(a))j 



-A T jb + A T jAa + Ai 



a 



\<*\\2j j 
aj 



+ X 2 tj 



-Ajb + A L jAjaj + Ai „ J + A 2 sj 

«J 2 



It follows that 



[A T jAj + ^V J I^I ) a J = AT J h ~ X ^J 



Next, as discussed in the proof of Theorem [2j for any s, / s is strictly decreasing in r with 
lim^oo f s (r) = 0. Therefore, there is at most one r > with f s (r) = 1. 

Next we check that setting r = 1 1 ck j- 1| 2 = II&H2 must satisfy /§(r) = 1. Indeed, 



^(d/r + Ai) 2 r 2 ^(d/ + A!r- 



3=1 v J ' j= 

^||pj + r- 1 Ai/| J |)- 1 «b||l = i||d||i 
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And, by definition, we would then have a = a(r). Furthermore, sign(a) = s by assumption. 
Finally, for all j J, since dy = and a is the optimal solution, by 13 



-Aj(b-A&) = -Ajb + AjAa + Xir^i- £ A 2 x [-1,1] 

Q! 2 



Therefore, feasibility conditions (12) hold. 

Conversely, take any arbitrary sign vector s G { — 1,0, 1} P and define Uj, Dj, v s , f s as stated 
above. Suppose some r > satisfies /s(r) = 1, and define a = a(r); suppose furthermore 
that the feasibility conditions (12) hold. Then by ( |10[ ) and (11), proceeding as in Theorem [2j 
we see that Hall! = r 2 , and 



A T jb - A 2 sj = Ajb - A 2 sign(aj) = Ajb - \ 2 tj . 



iT 



Ti 



a 2 



Moreover, for all j J, since ay = and the feasibility conditions (12) are satisfied, we have 
that 

a-i 



A 



-i 
l 



-Ajb + AjAa + Ai 



a h 



E[-l,l] . 



Therefore, with this definition of t, we see that a = a(r) gives a zero subgradient for Q in ( |13| ), 
therefore a is the unique minimizer of Q(a). Therefore, a(r) = a and s = sign(a(r)) = 
sign(a) = s. □ 
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